Python and R have a lot of similarities in the way they operate, but there are some slight differences.
Remember how I said computer languages are a lot like languages? Well, you’re about to become bilingual.
Depending on what your most comfortable with, you may think about how to say it in your primary language and then translate it to your secondar/tertiary language.
Python and R are kind of like Italian and Spanish - they’re different, but if you know one really well, learning the other language is not super hard. It will take work to master it, of course, but the translation is similar enough that you can figure it out quickly if you can find the write words you need to use.
A great resource that is listed in the syllabus is R for Data Science that is freely available online here - I heavily rely on this in the lecture. And another good option YaRrr! the Pirates Guide to R is here
R codes are stored in a series of servers across the world through Comprehensive R Archive Network (CRAN) - all of R related information is stored separately in each server (with the same exact information)
So, likely, the first thing you’ll need to do is to use a mirror near you - a list of mirrors can be found here
Once you set it, it’s done. No need to go back to this step unless for some reason you want to pull from a different server (maybe if you move, one that is closer to you)
options("repos" = c(CRAN = "http://lib.stat.cmu.edu/R/CRAN/"))
We call packages through “library” Such as:
“library(package_name)”
You have to call the package that you want in your notebook. In a moment, we’re going to start working with tidyverse and dplyr. But, just to see the importing of a package, let’s try it here.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
If you want o install a package you can do so, like this:
`install.packages(dplyr)’
OR you can install through the R Studio browser - I’ll show you that now.
Before we get to working with data, let’s do a quick over view of some basic commands
To change the working directory we use:
setwd(/your/working/directory/)
to find out where you are in your working directory we use:
`getwd()’
getwd()
## [1] "/Users/mkaltenberg/Documents/GitHub/Data_Analysis_Python_R/New R Kids on the Block"
setwd('/Users/mkaltenberg/Documents/GitHub/Data_Analysis_Python_R/New R Kids on the Block/')
In R documentation they almost always use <- but = also works.
x <- 'hi there!'
x
## [1] "hi there!"
y= 'bye'
You can change a variable name just as easily be reassigning the variable name.
(y <- seq(0, 10, 2))
## [1] 0 2 4 6 8 10
Like in Python, some words are best to never use (so you don’t override core programs in R)
A full list can be found here
To read a csv file you use
read.csv(path_to_csv_file)
To save a csv file you use write.csv(df, path_to_csv_file)
#Don't forget to assign it!
jobs_r <- read.csv('job-automation-probability.csv')
Ok, now you. Go onto classes and download the file and import into R Studio now. If you have an error or other issues, share your screen and I can help you.
Exporting data is also easy:
write.csv(jobs_r, 'jobs2.csv')
you can always ask for documentation, but that function is: help()
help(seq)
help('read.csv')
For packages, there is also a summary about the package and what it does with vignette
vignette('dplyr')
## starting httpd help server ... done
Objects can clog up our RAM, especially if they are large datafiles. If you want to remove an object the function is rm()
rm(x,y)
You can also removing EVERYTHING in your environment with
rm(list = ls())
A faster way to import data is the package readr - this becomes important with larger datasets where you want to efficiently read data into your computer.
Just remember that it will delete everything and you’d have to import all of your data again
library(readr)
jobs = read_csv('job-automation-probability.csv')
## Rows: 702 Columns: 13── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): _ - code, education, occupation, short occupation, employed_may2016
## dbl (8): _ - rank, prob, Average annual wage, len, probability, numbEmployed...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# IF you want to specify the delimiter
jobs = read_delim('job-automation-probability.csv', delim = ',')
## Rows: 702 Columns: 13── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): _ - code, education, occupation, short occupation, employed_may2016
## dbl (8): _ - rank, prob, Average annual wage, len, probability, numbEmployed...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
You can compare the two ways we imported the dataset, there are differences in the way they import:
#read.csv
jobs_r = read.csv('job-automation-probability.csv')
names(jobs_r)
## [1] "X_...rank" "X_...code" "prob"
## [4] "Average.annual.wage" "education" "occupation"
## [7] "short.occupation" "len" "probability"
## [10] "numbEmployed" "median_ann_wage" "employed_may2016"
## [13] "average_ann_wage"
#read_csv from readr
names(jobs)
## [1] "_ - rank" "_ - code" "prob"
## [4] "Average annual wage" "education" "occupation"
## [7] "short occupation" "len" "probability"
## [10] "numbEmployed" "median_ann_wage" "employed_may2016"
## [13] "average_ann_wage"
You can also import a variety of other formats like stata with Haven.
library(haven)
#read_sas("mtcars.sas7bdat")
#write_sas(mtcars, "mtcars.sas7bdat")
#read_sav("mtcars.sav")
#write_sav(mtcars, "mtcars.sav")
#read_dta("mtcars.dta")
#write_dta(mtcars, "mtcars.dta")
I’ll show some of the commands we used in python
Generally, you’re going to tell R 1. what the dataframe you are manipulation is 2. then the function you want to do
library(dplyr)
To get the names of the columns in the data frame
python jobs.columns
R names(df)
names(jobs)
## [1] "_ - rank" "_ - code" "prob"
## [4] "Average annual wage" "education" "occupation"
## [7] "short occupation" "len" "probability"
## [10] "numbEmployed" "median_ann_wage" "employed_may2016"
## [13] "average_ann_wage"
Select some of the columns:
python
jobs[[‘_ - code’, ‘prob’, ‘Average.annual.wage’, ‘education’, ‘numbEmployed’]]
R
jobs %>% select(c('_ - code', 'prob', 'Average annual wage',
'education', 'numbEmployed'))
## # A tibble: 702 × 5
## `_ - code` prob `Average annual wage` education numbEmployed
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 51-4033 0.95 34920 High school diploma or e… 74600
## 2 51-9012 0.88 41450 High school diploma or e… 47160
## 3 41-4012 0.85 68410 High school diploma or e… 1404050
## 4 53-1031 0.029 59800 High school diploma or e… 202760
## 5 51-4072 0.95 32660 High school diploma or e… 145560
## 6 51-6091 0.88 35420 High school diploma or e… 19340
## 7 51-4031 0.78 34210 High school diploma or e… 192800
## 8 41-4011 0.25 92910 Bachelor's degree 328370
## 9 51-4032 0.94 38880 High school diploma or e… 12290
## 10 51-9041 0.93 34370 High school diploma or e… 71260
## # ℹ 692 more rows
A more simplified way to do this in R
select(jobs, c('_ - code', 'prob', 'Average annual wage',
'education', 'numbEmployed'))
## # A tibble: 702 × 5
## `_ - code` prob `Average annual wage` education numbEmployed
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 51-4033 0.95 34920 High school diploma or e… 74600
## 2 51-9012 0.88 41450 High school diploma or e… 47160
## 3 41-4012 0.85 68410 High school diploma or e… 1404050
## 4 53-1031 0.029 59800 High school diploma or e… 202760
## 5 51-4072 0.95 32660 High school diploma or e… 145560
## 6 51-6091 0.88 35420 High school diploma or e… 19340
## 7 51-4031 0.78 34210 High school diploma or e… 192800
## 8 41-4011 0.25 92910 Bachelor's degree 328370
## 9 51-4032 0.94 38880 High school diploma or e… 12290
## 10 51-9041 0.93 34370 High school diploma or e… 71260
## # ℹ 692 more rows
There are multiple ways of calling a dataframe and applying a function.
The first way is df %>% select(c(col_names)
So, this funny thing %>% (called a pipe) is saying
that I am going to work with the dataframe named df and I want you to
apply a function called select.
I find it much more intuitive to use select(df, c(column_names))
Where, I have a function called select and I’m telling it that the dataframe name df is what I will apply the function select to.
Because I have a preference, we`ll stick to the latter form in the rest of the lecture - but, when you look at stack overflow and get confused as to other notation, recall that it’s the same-ish.
you just include the negative sign before the column list, and that will drop the selected columns you listed
names(jobs)
## [1] "_ - rank" "_ - code" "prob"
## [4] "Average annual wage" "education" "occupation"
## [7] "short occupation" "len" "probability"
## [10] "numbEmployed" "median_ann_wage" "employed_may2016"
## [13] "average_ann_wage"
select(jobs, -c('probability','_ - rank','employed_may2016' ,'average_ann_wage','len'))
## # A tibble: 702 × 8
## `_ - code` prob `Average annual wage` education occupation
## <chr> <dbl> <dbl> <chr> <chr>
## 1 51-4033 0.95 34920 High school diploma or equ… Grinding,…
## 2 51-9012 0.88 41450 High school diploma or equ… Separatin…
## 3 41-4012 0.85 68410 High school diploma or equ… Sales Rep…
## 4 53-1031 0.029 59800 High school diploma or equ… First-Lin…
## 5 51-4072 0.95 32660 High school diploma or equ… Molding, …
## 6 51-6091 0.88 35420 High school diploma or equ… Extruding…
## 7 51-4031 0.78 34210 High school diploma or equ… Cutting, …
## 8 41-4011 0.25 92910 Bachelor's degree Sales Rep…
## 9 51-4032 0.94 38880 High school diploma or equ… Drilling …
## 10 51-9041 0.93 34370 High school diploma or equ… Extruding…
## # ℹ 692 more rows
## # ℹ 3 more variables: `short occupation` <chr>, numbEmployed <dbl>,
## # median_ann_wage <dbl>
The same boolean operators in python (and any language) work in the same way.
This handy chart can help you figure out what boolean operators you want to use
python
`jobs[jobs[‘prob’]>.8]’
R
filter(jobs, prob >.8)
## # A tibble: 262 × 13
## `_ - rank` `_ - code` prob `Average annual wage` education occupation
## <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 624 51-4033 0.95 34920 High school dip… Grinding,…
## 2 517 51-9012 0.88 41450 High school dip… Separatin…
## 3 484 41-4012 0.85 68410 High school dip… Sales Rep…
## 4 620 51-4072 0.95 32660 High school dip… Molding, …
## 5 518 51-6091 0.88 35420 High school dip… Extruding…
## 6 590 51-4032 0.94 38880 High school dip… Drilling …
## 7 584 51-9041 0.93 34370 High school dip… Extruding…
## 8 477 51-4034 0.84 39630 High school dip… Lathe and…
## 9 560 51-4021 0.91 35340 High school dip… Extruding…
## 10 637 51-6064 0.96 28110 High school dip… TextileWi…
## # ℹ 252 more rows
## # ℹ 7 more variables: `short occupation` <chr>, len <dbl>, probability <dbl>,
## # numbEmployed <dbl>, median_ann_wage <dbl>, employed_may2016 <chr>,
## # average_ann_wage <dbl>
Python jobs[jobs[‘education’]==‘High school diploma or equivalent’]
R filter(df, column == ‘value’)
filter(jobs, education == 'High school diploma or equivalent')
## # A tibble: 307 × 13
## `_ - rank` `_ - code` prob `Average annual wage` education occupation
## <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 624 51-4033 0.95 34920 High school dip… Grinding,…
## 2 517 51-9012 0.88 41450 High school dip… Separatin…
## 3 484 41-4012 0.85 68410 High school dip… Sales Rep…
## 4 105 53-1031 0.029 59800 High school dip… First-Lin…
## 5 620 51-4072 0.95 32660 High school dip… Molding, …
## 6 518 51-6091 0.88 35420 High school dip… Extruding…
## 7 427 51-4031 0.78 34210 High school dip… Cutting, …
## 8 590 51-4032 0.94 38880 High school dip… Drilling …
## 9 584 51-9041 0.93 34370 High school dip… Extruding…
## 10 477 51-4034 0.84 39630 High school dip… Lathe and…
## # ℹ 297 more rows
## # ℹ 7 more variables: `short occupation` <chr>, len <dbl>, probability <dbl>,
## # numbEmployed <dbl>, median_ann_wage <dbl>, employed_may2016 <chr>,
## # average_ann_wage <dbl>
In pandas, we often used a tilda (~) to exclude something, In R you use an exclamation mark (!)
#python jobs[~[(education == ‘High school diploma or equivalent’ | education ==‘No formal educational credential’)]]
R filter(df,!(column == ‘value’ | column ==‘value’))
filter(jobs, !(education == 'High school diploma or equivalent'
| education =='No formal educational credential'))
## # A tibble: 297 × 13
## `_ - rank` `_ - code` prob `Average annual wage` education occupation
## <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 228 41-4011 0.25 92910 Bachelor's deg… Sales Rep…
## 2 554 49-2093 0.91 59840 Postsecondary … Electrica…
## 3 208 15-1179 0.21 67770 Associate's de… Informati…
## 4 254 49-2022 0.36 54520 Postsecondary … Telecommu…
## 5 103 17-2111 0.028 90190 Bachelor's deg… Health an…
## 6 205 25-3011 0.19 55140 Bachelor's deg… Adult Bas…
## 7 277 49-2094 0.41 56990 Postsecondary … Electrica…
## 8 41 25-2031 0.0078 61420 Bachelor's deg… Secondary…
## 9 261 49-2095 0.38 74540 Postsecondary … Electrica…
## 10 200 25-2022 0.17 59800 Bachelor's deg… Middle Sc…
## # ℹ 287 more rows
## # ℹ 7 more variables: `short occupation` <chr>, len <dbl>, probability <dbl>,
## # numbEmployed <dbl>, median_ann_wage <dbl>, employed_may2016 <chr>,
## # average_ann_wage <dbl>
and our good friend, is.na()
python
jobs[jobs[‘prob’].isnull()]
Note: missing is = ‘NA’
filter(jobs, is.na(prob))
## # A tibble: 0 × 13
## # ℹ 13 variables: _ - rank <dbl>, _ - code <chr>, prob <dbl>,
## # Average annual wage <dbl>, education <chr>, occupation <chr>,
## # short occupation <chr>, len <dbl>, probability <dbl>, numbEmployed <dbl>,
## # median_ann_wage <dbl>, employed_may2016 <chr>, average_ann_wage <dbl>
and to drop na items
python jobs[‘prob’].drop_na()
R filter(df, !is.na(column))
filter(jobs, !is.na(prob))
## # A tibble: 702 × 13
## `_ - rank` `_ - code` prob `Average annual wage` education occupation
## <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 624 51-4033 0.95 34920 High school dip… Grinding,…
## 2 517 51-9012 0.88 41450 High school dip… Separatin…
## 3 484 41-4012 0.85 68410 High school dip… Sales Rep…
## 4 105 53-1031 0.029 59800 High school dip… First-Lin…
## 5 620 51-4072 0.95 32660 High school dip… Molding, …
## 6 518 51-6091 0.88 35420 High school dip… Extruding…
## 7 427 51-4031 0.78 34210 High school dip… Cutting, …
## 8 228 41-4011 0.25 92910 Bachelor's degr… Sales Rep…
## 9 590 51-4032 0.94 38880 High school dip… Drilling …
## 10 584 51-9041 0.93 34370 High school dip… Extruding…
## # ℹ 692 more rows
## # ℹ 7 more variables: `short occupation` <chr>, len <dbl>, probability <dbl>,
## # numbEmployed <dbl>, median_ann_wage <dbl>, employed_may2016 <chr>,
## # average_ann_wage <dbl>
And you can rename variables in place
python equivalent jobs[[‘_ - code , _ - rank’]] #.rename(columns={‘_ - code’: ‘code’ , ‘_ - rank’:‘rank’}) R select(df, new_name=oldname)
select(jobs, code= '_ - code' , '_ - rank')
## # A tibble: 702 × 2
## code `_ - rank`
## <chr> <dbl>
## 1 51-4033 624
## 2 51-9012 517
## 3 41-4012 484
## 4 53-1031 105
## 5 51-4072 620
## 6 51-6091 518
## 7 51-4031 427
## 8 41-4011 228
## 9 51-4032 590
## 10 51-9041 584
## # ℹ 692 more rows
to rename just select columns, but keep the whole dataframe
python jobs.rename(columns={‘_ - code’: ‘code’ , ‘_ - rank’:‘rank’})
R rename(df, new_name=old_name)
rename(jobs, code='_ - code' , rank= '_ - rank')
## # A tibble: 702 × 13
## rank code prob `Average annual wage` education occupation
## <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 624 51-4033 0.95 34920 High school diploma or … Grinding,…
## 2 517 51-9012 0.88 41450 High school diploma or … Separatin…
## 3 484 41-4012 0.85 68410 High school diploma or … Sales Rep…
## 4 105 53-1031 0.029 59800 High school diploma or … First-Lin…
## 5 620 51-4072 0.95 32660 High school diploma or … Molding, …
## 6 518 51-6091 0.88 35420 High school diploma or … Extruding…
## 7 427 51-4031 0.78 34210 High school diploma or … Cutting, …
## 8 228 41-4011 0.25 92910 Bachelor's degree Sales Rep…
## 9 590 51-4032 0.94 38880 High school diploma or … Drilling …
## 10 584 51-9041 0.93 34370 High school diploma or … Extruding…
## # ℹ 692 more rows
## # ℹ 7 more variables: `short occupation` <chr>, len <dbl>, probability <dbl>,
## # numbEmployed <dbl>, median_ann_wage <dbl>, employed_may2016 <chr>,
## # average_ann_wage <dbl>
you can select information that contains a value will select only the column name related to X
R select(df, contains(‘value’))
select(jobs, contains("X"))
## # A tibble: 702 × 0
names(jobs)
## [1] "_ - rank" "_ - code" "prob"
## [4] "Average annual wage" "education" "occupation"
## [7] "short occupation" "len" "probability"
## [10] "numbEmployed" "median_ann_wage" "employed_may2016"
## [13] "average_ann_wage"
don’t forget you have to override the information if you want to save over the variable
jobs <-rename(jobs, code= '_ - code' , rank= '_ - rank', avg_ann_wage = "Average annual wage")
We can sort values in a dataframe with the function,
arrange()
It takes a data frame and a set of column names (or more complicated expressions) to order by.
Here, we are going in order of probability first and if there is a tie, education level breaks said tie
python jobs.sort_values([‘prob’, ‘education’], ascend=FALSE)
r arrange(df, column_name)
arrange(jobs, prob,education)
## # A tibble: 702 × 13
## rank code prob avg_ann_wage education occupation `short occupation` len
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 29-1… 0.0028 48190 Bachelor… Recreatio… Recreational Ther… 23
## 2 3 11-9… 0.003 78060 Bachelor… Emergency… Emergency Managem… 30
## 3 2 49-1… 0.003 66730 High sch… First-Lin… First-Line Superv… 61
## 4 4 21-1… 0.0031 47880 Bachelor… Mental He… Mental Health and… 48
## 5 5 29-1… 0.0033 79290 Doctoral… Audiologi… Audiologists 12
## 6 7 29-2… 0.0035 69920 Master's… Orthotist… Orthotists and Pr… 27
## 7 8 21-1… 0.0035 55510 Master's… Healthcar… Healthcare Social… 25
## 8 6 29-1… 0.0035 83730 Master's… Occupatio… Occupational Ther… 23
## 9 9 29-1… 0.0036 232870 Doctoral… Oral and … Oral and Maxillof… 31
## 10 10 33-1… 0.0036 77050 Postseco… First-Lin… First-Line Superv… 62
## # ℹ 692 more rows
## # ℹ 5 more variables: probability <dbl>, numbEmployed <dbl>,
## # median_ann_wage <dbl>, employed_may2016 <chr>, average_ann_wage <dbl>
r arrange(df, desc(column_name))
arrange(jobs, desc(prob))
## # A tibble: 702 × 13
## rank code prob avg_ann_wage education occupation `short occupation` len
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 694 51-91… 0.99 31740 High sch… Photograp… Photographic Proc… 61
## 2 701 23-20… 0.99 51490 High sch… Title Exa… Title Examiners, … 42
## 3 696 43-50… 0.99 44250 High sch… Cargo and… Cargo and Freight… 24
## 4 699 15-20… 0.99 58490 Bachelor… Mathemati… Mathematical Tech… 24
## 5 698 13-20… 0.99 75480 Bachelor… Insurance… Insurance Underwr… 22
## 6 692 25-40… 0.99 34780 Postseco… Library T… Library Technicia… 19
## 7 693 43-41… 0.99 36480 High sch… New Accou… New Accounts Cler… 19
## 8 691 43-90… 0.99 31640 High sch… Data Entr… Data Entry Keyers 17
## 9 697 49-90… 0.99 39720 High sch… Watch Rep… Watch Repairers 15
## 10 695 13-20… 0.99 45340 High sch… Tax Prepa… Tax Preparers 13
## # ℹ 692 more rows
## # ℹ 5 more variables: probability <dbl>, numbEmployed <dbl>,
## # median_ann_wage <dbl>, employed_may2016 <chr>, average_ann_wage <dbl>
You may want to add a new columns that are functions of existing
columns - and that function is mutate()
mutate() always adds new columns at the end of your dataset.
You can create a whole variety of new variables, as in python. Here are some useful tips on this:
Arithmetic operators: +, -, *, /, ^. These are all vectorised, using the so called “recycling rules”. If one parameter is shorter than the other, it will be automatically extended to be the same length.
Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces.
Logs: log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.
Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)). This is useful for regressions with time series.
Logical comparisons, <, <=, >, >=, !=, and == If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.
Ranking: there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.
here, we can see that the new variable diff is added on at the end
python jobs[‘diff’] = jobs[‘avg_ann_wage’]-jobs[‘median_ann_wage’]
r
mutate(jobs,
diff = avg_ann_wage - median_ann_wage)
## # A tibble: 702 × 14
## rank code prob avg_ann_wage education occupation `short occupation` len
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 624 51-40… 0.95 34920 High sch… Grinding,… Tool setters, ope… 35
## 2 517 51-90… 0.88 41450 High sch… Separatin… Tool setters, ope… 35
## 3 484 41-40… 0.85 68410 High sch… Sales Rep… Sales Representat… 92
## 4 105 53-10… 0.029 59800 High sch… First-Lin… Supervisors Trans… 26
## 5 620 51-40… 0.95 32660 High sch… Molding, … Molding, Coremaki… 89
## 6 518 51-60… 0.88 35420 High sch… Extruding… Extruding and For… 88
## 7 427 51-40… 0.78 34210 High sch… Cutting, … Cutting, Punching… 85
## 8 228 41-40… 0.25 92910 Bachelor… Sales Rep… Sales Representat… 85
## 9 590 51-40… 0.94 38880 High sch… Drilling … Drilling and Bori… 82
## 10 584 51-90… 0.93 34370 High sch… Extruding… Extruding, Formin… 82
## # ℹ 692 more rows
## # ℹ 6 more variables: probability <dbl>, numbEmployed <dbl>,
## # median_ann_wage <dbl>, employed_may2016 <chr>, average_ann_wage <dbl>,
## # diff <dbl>
If you only want to keep the new variables, use transmute()
python diff = jobs[‘avg_ann_wage’] - jobs[‘median_ann_wage’]
transmute(jobs,
diff = avg_ann_wage - median_ann_wage)
## # A tibble: 702 × 1
## diff
## <dbl>
## 1 2030
## 2 3090
## 3 11270
## 4 2530
## 5 2180
## 6 1180
## 7 1840
## 8 13930
## 9 2470
## 10 1860
## # ℹ 692 more rows
You can combine mutate with boolean filters
jobs %>% filter(occupation %in% c('Economists')) %>% mutate(
diff = avg_ann_wage - median_ann_wage)
## # A tibble: 1 × 14
## rank code prob avg_ann_wage education occupation `short occupation` len
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 282 19-3011 0.43 112860 Master's… Economists Economists 10
## # ℹ 6 more variables: probability <dbl>, numbEmployed <dbl>,
## # median_ann_wage <dbl>, employed_may2016 <chr>, average_ann_wage <dbl>,
## # diff <dbl>
We can get simple summary statistics of our dataframe, like we did in pandas with describe (but a but more complicated)
In R, the function is the british spelling, `summarise()’
*technically, it works with a z, too.
This is is the mean probability of the entire dataset excluding any values that are NA:
summarise(jobs, prob = mean(prob, na.rm=TRUE))
## # A tibble: 1 × 1
## prob
## <dbl>
## 1 0.536
We can use summarise in conjuction with groupby, which is the same process as in python pandas. It will split the data into groups that you need and then you will use the summarise function to apply a statistic to those groups.
we can create multiple new items in one groupby function:
by_educ <- group_by(jobs, education)
educ_wage <-summarise(by_educ, av_wage_educ = mean(avg_ann_wage, na.rm=TRUE),
count = n())
Merge dataframes together. Like pandas, you can join dataframes as left, right, inner or outer. There are similar combinations with some exceptions. You can check out the documentation here for more details.
the general format is:
join_type(df1, df2, by=c(“key1_name”, “key2_name”))
by_educ <- group_by(jobs, education) #Create another dataframe by education group
educ_emp <-summarise(by_educ, Nemp = sum(numbEmployed, na.rm=TRUE)) # That contains the number of employees by educaiton group
python
pd.merge(educ_emp, educ_wage, on =‘education’, how=‘left’)
R
left_join(educ_emp,educ_wage, by= c("education"))
## # A tibble: 8 × 4
## education Nemp av_wage_educ count
## <chr> <dbl> <dbl> <int>
## 1 Associate's degree 2993610 56492. 44
## 2 Bachelor's degree 25946820 80602. 155
## 3 Doctoral or professional degree 2424010 126743. 23
## 4 High school diploma or equivalent 49420870 44011. 307
## 5 Master's degree 1926250 75966. 29
## 6 No formal educational credential 38642320 33031. 98
## 7 Postsecondary nondegree award 6823230 48555. 42
## 8 Some college, no degree 2981570 44516. 4
I want you to start to get familiar with R already and deal with any trouble shooting issues that you might have.
Resources:
Vignette (from the tidyr package)
Original paper (Hadley Wickham, 2014 JSS) <- this author is the same as the book I mentioned earlier